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Implementing instruction digrams in Zoea turns out to be a simple matter of 
creating additional instruction subsets. These are produced up-front, meaning that 


instruction digrams have zero additional processing overhead at run-time. 
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The biggest problem that has always faced inductive programming is the enormous size of 
the search space. This is because most programming languages contain hundreds of 
instructions that can be combined in a multitude of different ways. As the size of target 
programs increase, the corresponding search space grows exponentially. This has led 
many people to conclude that the problem is basically impossible to solve. Fortunately, this 


is an incorrect assessment. 
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While programming languages may contain hundreds of instructions, the vast majority of 
individual programs use relatively few of them - often less than twenty. Therefore, if it were 
somehow possible to predict the instructions that are actually required for a given problem, 
then inductive programming would be much more tractable. This intuition led to the 
concept of instruction subsets in Zoea, where many subsets correspond to different 


predictions of the required instructions. 


Human coders don't use all instructions with equal frequency and the co-occurrence of 
instructions within programs is highly skewed. This is effectively tacit programming 
knowledge and we can capture it by creating instruction subsets from a large sample of 
existing code. Zoea uses these subsets to reduce the size of the search space, often by 


tens of orders of magnitude. 


Support for instruction subsets was designed into Zoea from the very outset. This means 
that every Zoea knowledge source can be activated with respect to any subset - which can 


be a single instruction, the complete instruction set or anything between. 


Instruction subsets specify the combinations of instructions that can be used together to 
form programs. However, they say nothing about the actual structure of programs. 


Instruction digrams on the other hand prescribe how instructions can be assembled. 


Each digram is simply a pair of instruction identifiers, which signify that the output of 
instruction 1 can be used as the input to instruction 2. If an instruction has multiple inputs 


then these correspond to different digrams. 


Both instruction subsets and digrams can be viewed as constraints that apply to all 
generated programs. Each candidate solution only uses instructions from one subset, 
although very often it will use just some of them. Similarly, each instruction application in 


every program will correspond to one of the digrams. 


The frequency distribution of instruction digrams in human code is also highly skewed and 
this is what makes them useful. Coders combine instructions in relatively few ways. 
Therefore, Zoea can potentially avoid the many combinations that people never use. As 


with subsets, the instruction digrams used in Zoea were also derived from a large code 
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sample. Creating digrams is one thing but they also need to be integrated into Zoea in 


some way. 


Implementing instruction digrams on top of Zoea's existing instruction subset mechanism 
could be accomplished in a number of ways. For example, we might simply enforce every 
digram as a constraint, for each instruction that we add to a candidate solution. Whilst this 
approach would work, the high amount of additional processing involved would seriously 
impact performance. A better approach is to apply instruction digrams through pre- 


processing. 


One way to think about instruction digrams is in terms of their impact on instruction 
subsets. Although instruction digrams are primarily intended to describe how instructions 
can be put together, the side effect of applying instruction digrams to any subset is to 


create another subset. 


In data flow terms, we can think of every instruction in a program as being some distance 
from the input. Instructions that use input directly are level 1, while those that use input 


with one level of indirection are level 2, and so on. 


For any given subset, let us assume that all subset instructions could potentially be applied 
to program input. These level 1 instructions also dictate which digrams will apply, since 
instruction 1 in the digram needs to be one of these instructions. Clearly, instruction 2 in 
the digram also needs to be in the subset, but each of these is also an instruction at level 
2. In effect, applying the digrams determines the possible instructions at level 2, and while 


this new subset can be the same size, it is frequently smaller. 


We can repeat this process for any number of levels, producing different subsets for each 
level. At some stage, usually between levels 3 and 10, the subset size stabilises. What we 
end up with are variants of all instruction subsets for each data flow level. These can easily 


be pre-computed. 


In effect, we have compiled all of the instruction digrams to produce a larger number of 
subsets. However, this is not a problem since Zoea already applies an instruction subset at 
each level, so it doesn't matter if the subsets are different. Also, implementing digrams in 


this way requires only a trivial change in the Zoea codebase. 
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With this approach, Zoea doesn't have to do any extra work at run-time in order to enforce 
instruction digrams. Rather, as we have noted previously, digrams provide a further huge 
reduction in the size of the search space. As a bonus, since digram compilation transforms 
digrams to subsets, this approach is applicable to other types of production system, such 


as rules, rewriting systems and grammars. 


Learn more at zoea.co.uk 
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